library(readr)
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
[37m── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──[39m
[37m[32m✔[37m [34mggplot2[37m 3.2.0 [32m✔[37m [34mpurrr [37m 0.3.2
[32m✔[37m [34mtibble [37m 2.1.3 [32m✔[37m [34mdplyr [37m 0.8.2
[32m✔[37m [34mtidyr [37m 0.8.3 [32m✔[37m [34mstringr[37m 1.4.0
[32m✔[37m [34mggplot2[37m 3.2.0 [32m✔[37m [34mforcats[37m 0.4.0[39m
[37m── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31m✖[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
diamond <- read_csv("diamonds.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
carat = [32mcol_double()[39m,
cut = [31mcol_character()[39m,
color = [31mcol_character()[39m,
clarity = [31mcol_character()[39m,
depth = [32mcol_double()[39m,
table = [32mcol_double()[39m,
price = [32mcol_double()[39m,
x = [32mcol_double()[39m,
y = [32mcol_double()[39m,
z = [32mcol_double()[39m
)
diamond_trim <- subset(diamond, select = c("carat", "x", "y", "z"))
summary(diamond_trim)
carat x y z
Min. :0.2000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:0.4000 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median :0.7000 Median : 5.700 Median : 5.710 Median : 3.530
Mean :0.7979 Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.:1.0400 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :5.0100 Max. :10.740 Max. :58.900 Max. :31.800
Load the diamonds.csv data set and undertake an initial exploration of the data. You will find a description of the meanings of the variables on the relevant Kaggle page
We expect the carat of the diamonds to be strong correlated with the physical dimensions x, y and z. Use ggpairs() to investigate correlations between these four variables.
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Attaching package: ‘GGally’
The following object is masked from ‘package:dplyr’:
nasa
ggpairs(diamond_trim)
So, we do find significant correlations. Let’s drop columns x, y and z from the dataset, in preparation to use only carat going forward.
diamond_carat <- subset(diamond)[, 1:8]
We are interested in developing a regression model for the price of a diamond in terms of the possible predictor variables in the dataset. Use ggpairs() to investigate correlations between price and the predictors (this may take a while to run, don’t worry, make coffee or something).
ggpairs(diamond_carat)
Perform further ggplot visualisations of any significant correlations you find.
diamond_carat %>%
ggplot(aes(x = carat, y = price)) +
geom_point()
diamond_carat %>%
ggplot(aes(x = cut, y = price)) +
geom_boxplot()
diamond_carat %>%
ggplot(aes(x = color, y = price)) +
geom_boxplot()
diamond_carat %>%
ggplot(aes(x = clarity, y = price)) +
geom_boxplot()
Shortly we may try a regression fit using one or more of the categorical predictors cut, clarity and color, so let’s investigate these predictors: Investigate the factor levels of these predictors. How many dummy variables do you expect for each of them?
length(unique(diamond_carat$cut))
[1] 5
length(unique(diamond_carat$color))
[1] 7
length(unique(diamond_carat$clarity))
[1] 8
Use the dummy_cols() function in the fastDummies package to generate dummies for these predictors and check the number of dummies in each case.
library(fastDummies)
diamond_carat <- dummy_cols(diamond_carat,
select_columns = c("cut", "color", "clarity"),
remove_first_dummy = TRUE)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Going forward we’ll let R handle dummy variable creation for categorical predictors in regression fitting (remember lm() will generate the correct numbers of dummy levels automatically, absorbing one of the levels into the intercept as a reference level) First, we’ll start with simple linear regression. Regress price on carat and check the regression diagnostics.
price_carat_model <- lm(price ~ carat,
data = diamond_carat)
summary(price_carat_model)
Call:
lm(formula = price ~ carat, data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-18585.3 -804.8 -18.9 537.4 12731.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2256.36 13.06 -172.8 <2e-16 ***
carat 7756.43 14.07 551.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1549 on 53938 degrees of freedom
Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
plot(price_carat_model)
Run a regression with one or both of the predictor and response variables log() transformed and recheck the diagnostics. Do you see any improvement?
diamond_carat <- diamond_carat %>%
mutate(price_log = log(price)) %>%
mutate(carat_log =log(carat))
log_log_price_carat <- lm(price_log ~ carat_log,
data = diamond_carat)
summary(log_log_price_carat)
Call:
lm(formula = price_log ~ carat_log, data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-1.50833 -0.16951 -0.00591 0.16637 1.33793
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.448661 0.001365 6190.9 <2e-16 ***
carat_log 1.675817 0.001934 866.6 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2627 on 53938 degrees of freedom
Multiple R-squared: 0.933, Adjusted R-squared: 0.933
F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16
plot(log_log_price_carat)
log_price <- lm(price_log ~ carat,
data = diamond_carat)
summary(log_price)
Call:
lm(formula = price_log ~ carat, data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16
plot(log_price)
log_carat <- lm(price ~ carat_log,
data = diamond_carat)
summary(log_carat)
Call:
lm(formula = price ~ carat_log, data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-6137.4 -1495.1 -328.3 1141.9 12075.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6237.84 10.73 581.2 <2e-16 ***
carat_log 5836.02 15.21 383.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2066 on 53938 degrees of freedom
Multiple R-squared: 0.7319, Adjusted R-squared: 0.7319
F-statistic: 1.473e+05 on 1 and 53938 DF, p-value: < 2.2e-16
plot(log_carat)
Let’s use log() transformations of both predictor and response. Next, experiment with adding a single categorical predictor into the model. Which categorical predictor is best? [Hint - investigate r2 values]
log_log_cut <- lm(price_log ~ carat_log + cut,
data = diamond_carat)
log_log_color <- lm(price_log ~ carat_log + color,
data = diamond_carat)
log_log_clarity <- lm(price_log ~ carat_log + clarity,
data = diamond_carat)
summary(log_log_cut)
Call:
lm(formula = price_log ~ carat_log + cut, data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-1.52247 -0.16484 -0.00587 0.16087 1.38115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.200125 0.006343 1292.69 <2e-16 ***
carat_log 1.695771 0.001910 887.68 <2e-16 ***
cutGood 0.163245 0.007324 22.29 <2e-16 ***
cutIdeal 0.317212 0.006632 47.83 <2e-16 ***
cutPremium 0.238217 0.006716 35.47 <2e-16 ***
cutVery Good 0.240753 0.006779 35.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2545 on 53934 degrees of freedom
Multiple R-squared: 0.9371, Adjusted R-squared: 0.9371
F-statistic: 1.607e+05 on 5 and 53934 DF, p-value: < 2.2e-16
summary(log_log_color)
Call:
lm(formula = price_log ~ carat_log + color, data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-1.49987 -0.14888 0.00257 0.15316 1.27815
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.572034 0.003051 2809.531 < 2e-16 ***
carat_log 1.728631 0.001814 952.727 < 2e-16 ***
colorE -0.025460 0.003748 -6.793 1.11e-11 ***
colorF -0.034455 0.003773 -9.132 < 2e-16 ***
colorG -0.055399 0.003653 -15.166 < 2e-16 ***
colorH -0.189859 0.003917 -48.468 < 2e-16 ***
colorI -0.286928 0.004383 -65.467 < 2e-16 ***
colorJ -0.425038 0.005417 -78.466 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2372 on 53932 degrees of freedom
Multiple R-squared: 0.9454, Adjusted R-squared: 0.9454
F-statistic: 1.333e+05 on 7 and 53932 DF, p-value: < 2.2e-16
summary(log_log_clarity)
Call:
lm(formula = price_log ~ carat_log + clarity, data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-0.97521 -0.12085 0.01048 0.12561 1.85854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.768115 0.006940 1119.25 <2e-16 ***
carat_log 1.806424 0.001514 1193.23 <2e-16 ***
clarityIF 1.114625 0.008376 133.07 <2e-16 ***
claritySI1 0.624558 0.007163 87.19 <2e-16 ***
claritySI2 0.479658 0.007217 66.46 <2e-16 ***
clarityVS1 0.820461 0.007306 112.30 <2e-16 ***
clarityVS2 0.775248 0.007197 107.72 <2e-16 ***
clarityVVS1 1.028298 0.007745 132.77 <2e-16 ***
clarityVVS2 0.979221 0.007529 130.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1888 on 53931 degrees of freedom
Multiple R-squared: 0.9654, Adjusted R-squared: 0.9654
F-statistic: 1.879e+05 on 8 and 53931 DF, p-value: < 2.2e-16
Interpret the fitted coefficients for the levels of your chosen categorical predictor. Which level is the reference level? Which level shows the greatest difference in price from the reference level? [Hints - remember we are regressing the log(price) here, and think about what the presence of the log(carat) predictor implies. We’re not expecting a mathematical explanation]
unique(diamond_carat$clarity)
[1] "SI2" "SI1" "VS1" "VS2" "VVS2" "VVS1" "I1" "IF"
I1 is the reference level log_price of diamind is 1.114625 for clarity SI1 than fro reference clarity I1 for a fixed log(carat)
log of something < 1 is negative -> price is lower than reference log of something > 1 is positive -> increase in price to reference
2 Extension Try adding an interaction between log(carat) and your chosen categorical predictor. Do you think this interaction term is statistically justified?
log_log_caratclarity <- lm(price_log ~ carat_log + clarity + carat_log:clarity,
data = diamond_carat)
summary(log_log_caratclarity)
Call:
lm(formula = price_log ~ carat_log + clarity + carat_log:clarity,
data = diamond_carat)
Residuals:
Min 1Q Median 3Q Max
-0.92773 -0.12104 0.01212 0.12465 1.51830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.808102 0.007223 1080.98 <2e-16 ***
carat_log 1.528106 0.014944 102.25 <2e-16 ***
clarityIF 1.122732 0.011381 98.65 <2e-16 ***
claritySI1 0.587556 0.007465 78.71 <2e-16 ***
claritySI2 0.438797 0.007486 58.62 <2e-16 ***
clarityVS1 0.790989 0.007721 102.45 <2e-16 ***
clarityVS2 0.723109 0.007530 96.03 <2e-16 ***
clarityVVS1 1.007591 0.009506 106.00 <2e-16 ***
clarityVVS2 0.968426 0.008359 115.85 <2e-16 ***
carat_log:clarityIF 0.337116 0.017593 19.16 <2e-16 ***
carat_log:claritySI1 0.288214 0.015254 18.89 <2e-16 ***
carat_log:claritySI2 0.258795 0.015437 16.76 <2e-16 ***
carat_log:clarityVS1 0.300307 0.015395 19.51 <2e-16 ***
carat_log:clarityVS2 0.250187 0.015237 16.42 <2e-16 ***
carat_log:clarityVVS1 0.301955 0.016317 18.51 <2e-16 ***
carat_log:clarityVVS2 0.321665 0.015717 20.47 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1877 on 53924 degrees of freedom
Multiple R-squared: 0.9658, Adjusted R-squared: 0.9658
F-statistic: 1.014e+05 on 15 and 53924 DF, p-value: < 2.2e-16
Not justified, R_squared still pretty much the same
Find and plot an appropriate visualisation to show the effect of this interaction
coplot(price_log ~ carat_log | clarity,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")
},
overlap = 0.2,
data = diamond_carat)